Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to read single parquet file hosted in AWS S3 #4972

Merged
merged 41 commits into from
Jan 29, 2024

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Dec 21, 2023

@malhotrashivam malhotrashivam added core Core development tasks parquet Related to the Parquet integration DocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Dec 21, 2023
@malhotrashivam malhotrashivam added this to the December 2023 milestone Dec 21, 2023
@malhotrashivam malhotrashivam self-assigned this Dec 21, 2023
- Added an async response transfer to save on copies
- Moved HEAD request to constructor
- Saved on one copy while reading
- Using a CRT based client
- Merged the two read channels
- Updated caffeine version
- Added a shared cache between channels
- Removed delete old entries from cache logic
- Made cache size configurable
@malhotrashivam malhotrashivam force-pushed the sm-s3-pq branch 5 times, most recently from dfdbad0 to dab479f Compare December 29, 2023 11:04
py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
py/server/deephaven/parquet.py Outdated Show resolved Hide resolved
py/server/deephaven/parquet.py Outdated Show resolved Hide resolved
py/server/deephaven/parquet.py Show resolved Hide resolved
py/server/deephaven/experimental/s3.py Show resolved Hide resolved
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close now.

py/server/deephaven/experimental/s3.py Show resolved Hide resolved
from deephaven.column import Column
from deephaven.dtypes import DType
from deephaven.table import Table
from deephaven.experimental import s3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case. We have experimental bleeding into our normal API. I may be ok with this, but @jmao-denver should weigh in. I may be ok with it because it is obvious it is experimental, so yanking the rug later could be ok.

Do you expect the API to change going forward? If not, it could be promoted out of experimental. If it needs baking time, leave it there.

Copy link
Contributor Author

@malhotrashivam malhotrashivam Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does need more baking time, especially since we will be adding more capabilities to it very soon. Right now, we only support reading a single parquet file. Going forward, we want to support reading partitioned data, metadata files, etc.

Copy link
Contributor

@jmao-denver jmao-denver Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow experimental feature to bleed into the normal API, maybe a factory/helper function s3_instructions(...) is more desirable than forcing the user to use experimenta.s3.S3Instructions directly? At least, when we decide to graduate S3Instructions from experimental, the user doesn't need to change the code, even though code change comes with the territory of using experimental features. But I am OK with it as is. @chipkent your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably lean to what is currently here because it makes very explicit that it is experimental. I would kind of rather force them to change the code because it also forces them to know it is experimental.

rcaudy
rcaudy previously approved these changes Jan 26, 2024
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style issue found and a possible interface change suggestion.

py/server/deephaven/experimental/s3.py Outdated Show resolved Hide resolved
from deephaven.column import Column
from deephaven.dtypes import DType
from deephaven.table import Table
from deephaven.experimental import s3
Copy link
Contributor

@jmao-denver jmao-denver Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow experimental feature to bleed into the normal API, maybe a factory/helper function s3_instructions(...) is more desirable than forcing the user to use experimenta.s3.S3Instructions directly? At least, when we decide to graduate S3Instructions from experimental, the user doesn't need to change the code, even though code change comes with the territory of using experimental features. But I am OK with it as is. @chipkent your thoughts?

Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@malhotrashivam malhotrashivam merged commit db04b57 into deephaven:main Jan 29, 2024
19 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 29, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: https://github.com/deephaven/deephaven.io/issues/3658

@malhotrashivam
Copy link
Contributor Author

Will add a separate PR if we need more changes around the python interface.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core Core development tasks DocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants